933 research outputs found

    A Benchmark Suite for Template Detection and Content Extraction

    Full text link
    Template detection and content extraction are two of the main areas of information retrieval applied to the Web. They perform different analyses over the structure and content of webpages to extract some part of the document. However, their objective is different. While template detection identifies the template of a webpage (usually comparing with other webpages of the same website), content extraction identifies the main content of the webpage discarding the other part. Therefore, they are somehow complementary, because the main content is not part of the template. It has been measured that templates represent between 40% and 50% of data on the Web. Therefore, identifying templates is essential for indexing tasks because templates usually contain irrelevant information such as advertisements, menus and banners. Processing and storing this information is likely to lead to a waste of resources (storage space, bandwidth, etc.). Similarly, identifying the main content is essential for many information retrieval tasks. In this paper, we present a benchmark suite to test different approaches for template detection and content extraction. The suite is public, and it contains real heterogeneous webpages that have been labelled so that different techniques can be suitable (and automatically) compared.Comment: 13 pages, 3 table

    An Analysis of the Current Program Slicing and Algorithmic Debugging Based Techniques

    Full text link
    This thesis presents a classification of program slicing based techniques. The classification allows us to identify the differences between existing techniques, but it also allows us to predict new slicing techniques. The study identifies and compares the dimensions that influence current techniques.Silva Galiana, JF. (2008). An Analysis of the Current Program Slicing and Algorithmic Debugging Based Techniques. http://hdl.handle.net/10251/14300Archivo delegad

    Page-Level Main Content Extraction from Heterogeneous Webpages

    Full text link
    [EN] The main content of a webpage is often surrounded by other boilerplate elements related to the template, such as menus, advertisements, copyright notices, and comments. For crawlers and indexers, isolating the main content from the template and other noisy information is an essential task, because processing and storing noisy information produce a waste of resources such as bandwidth, storage space, and computing time. Besides, the detection and extraction of the main content is useful in different areas, such as data mining, web summarization, and content adaptation to low resolutions. This work introduces a new technique for main content extraction. In contrast to most techniques, this technique not only extracts text, but also other types of content, such as images, and animations. It is a Document Object Model-based page-level technique, thus it only needs to load one single webpage to extract the main content. As a consequence, it is efficient enough as to be used online (in real-time). We have empirically evaluated the technique using a suite of real heterogeneous benchmarks producing very good results compared with other well-known content extraction techniques.This work has been partially supported by the EU (FEDER) and the Spanish MCI/AEI under grants TIN2016-76843-C4-1-R and PID2019-104735RB-C41, by the Generalitat Valenciana under grant Prometeo/2019/098 (DeepTrust), and by TAILOR, a project funded by EU Horizon 2020 research and innovation programme under GA No 952215.Alarte, J.; Silva, J. (2021). Page-Level Main Content Extraction from Heterogeneous Webpages. ACM Transactions on Knowledge Discovery from Data. 15(6):1-21. https://doi.org/10.1145/3451168S12115

    ADJUSTING CORRELATION MATRICES

    Get PDF
    The article proposes a new algorithm for adjusting correlation matrices and for comparison with Finger's algorithm, which is used to compute Value-at-Risk in RiskMetrics for stress test scenarios. The solution proposed by the new methodology is always better than Finger's approach in the sense that it alters as little as possible those correlations that we do not wish to alter but they change in order to obtain a consistent Finger correlation matrix.Stochastic, Volatility, Skewness, Kurtosis, Pricing.

    Materiales y recursos digitales

    Get PDF
    La oferta de herramientas, materiales y contenidos educativos digitales es ilimitada y obliga al profesorado a buscarlos, seleccionarlos y almacenarlos de una forma muy distinta a cómo estaba acostumbrado. ¿De qué recursos dispone? ¿Cómo puede organizarlos? ¿Y compartirlos
    corecore